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Abstract 

We explore the feasibility of implementing a reli- 
able, high performance, distributed storage system 
on a commodity computing cluster. Files are dis- 
tributed across storage nodes using erasure coding 
with small Low-Density Parity-Check (LDPC) codes 
which provide high reliability while keeping the stor- 
age and performance overhead small. We present per- 
formance measurements done on a prototype system 
comprising 50 nodes which are self organised using a 
peer-to-peer overlay. 

1 Introduction 

With the growing popularity of Grid and cluster com- 
puting, computer clusters built out of cheap commod- 
ity computers are becoming commonplace. While 
their CPU power is readily made available through 
batch or grid computing systems, the often substan- 
tial amount of disk space on the computer nodes is 
usually not made available for mid or long term stor- 
age. In this paper we investigate how to make this 
space available to be used as high performant and 
reliable file storage for appHcations where files are 
written once and read often. 

Statistical analysis on the availability of dis- 



tributed files shows that erasure coding, where the 
file is decomposed into n data and m coding blocks 
of equal size and can be reconstructed from any n 
blocks. Statistical analysis shows JHEI that when 
the hosts on which the data is stored are relatively re- 
liable, erasure coding is able to ensure a much higher 
availability than full file replication, while introduc- 
ing a smaller storage overhead. However, in practice, 
most of the storage systems used on Local Area Net- 
works rely on replication to ensure reliability. The 
reason is that traditional erasure coding like Reed- 
Solomon-Codes demand a high computational effort, 
which grows quadratically with n and to, to reassem- 
ble the original data out of any n data or coding 
blocks. 

Low-Density Parity-Check codes (LDPC) [6j pro- 
vide a solution to this problem because they allow 
to reconstruct the original data using relatively few 
and cheap XOR operations. They do not, however, 
code the data optimally (in contrast to Reed-Solomon 
codes) but require fn blocks to reconstruct the stored 
file, where / > 1. The properties of LDPC codes 
are well understood in the asymptotics of n — > cx) 
where / ^ 1, but little is known about how to con- 
struct smaller codes (n, to < 1000). The discovery 
of efficient algorithms for creating large LDPC codes 
{n > 10000) with very fast encoding and decoding 



|12j has lead to a surge in the interest in these codes, 
in particular for the resilient storage of files on Grid 
and peer-to-peer networks. In this scenario a file is 
decomposed into many {n+m large) blocks which are 
stored in a distributed manner on hosts connected by 
a Wide Area Network (WAN) 0. 

The paper is organised as follows. In section [2l 
we compare the availability provided by LDPC codes 
versus erasure coding and replication. We explain 
why small LDPC codes (n, m « 10) fit better the cri- 
teria needed for the implementation of a distributed 
storage system in a commodity computing cluster. 
These small codes cannot be constructed with stan- 
dard techniques. We therefore present a way of con- 
structing graphs with good guarantees on their re- 
dundancy using Monte Carlo techniques in sectional 
In sectional we describe the implementation of a file 
storage system based on small LDPC codes. Perfor- 
mance measurements are presented in sectional The 
remainder of the paper is a discussion of the imple- 
mentation and the results obtained so far including 
references to related work (sectional and finally con- 
clusions and a preview of our ongoing and future work 
(section 

2 Availability Analysis of LDPC 
vs Erasure Coding vs Replica- 
tion 

In the following we will give an overview on the stor- 
age overhead and availability of normal erasure codes, 
LDPC codes and common file replication. A detailed 
study can be found in e.g. jlll 120) . 

Written in a common form, the availability of a 
file replicated S times {S = (n -I- to) /n is called the 
stretch factor) are 
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and for an erasure coded file 
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Figure 1: The average failure rate of files stored 
redundantly using replication and using LDPC codes 
versus a given storage overhead. We choose n — % 
for the LDPC codes and assume an overhead factor 
of / = L1. 



where ^ is the availability of a host. The rate of the 
codes is 

^ 1 

R=^^^— ■ 3 

b n + m 

LDPC codes which do not code optimally introduce 
an overhead factor /, this means they are only able 
to reconstruct the original file from in average fn 
chunks, where / > 1. Concerning the overhead, 
LDPC codes are comparable to normal erasure codes 
with 



fn and m' — (\ — f)n + m 



(4) 



An upper bound for the availability of an LDPC en- 
coded file can be given by ^ using /max, the max- 
imum overhead of a graph (that is the original data 
can be reconstructed from any fmaxn blocks). 

Figure shows a comparison of the failure rate 
(1 — A) of files stored redundantly using replication 
with LDPC encoded files with n = 8 and with an 
assumed overhead factor of / — 1.1 for three dif- 
ferent node availabilities of /i = 0.5, 0.95 and 0.99. 
For bad availability of files, LDPC codes with such 
small number of coding blocks perform worse than 
file replication, at least for small storage overhead 
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Figure 2: An example of a systematic graph with 
n = 3 and m = 2. 

factors of S" < 1.5. However for a good availabihty 
/I = 0.95 (an estimate for the availabihty of nodes 
in a cluster of custom hardware) or even 0.99 small 
LDPC codes can provide a much better availability of 
the files for smaller overheads. This means that small 
LDPC codes have the potential to provide small stor- 
age overhead and excellent file availability on LANs 
while introducing only a small networking overhead 
due to the relatively small number of n parallel down- 
loads. 

3 LDPC Codes 

In the following we will give a brief overview of LDPC 
codes, for a full introduction see e.g. [2]. An ex- 
ample of a graph describing a simple LDPC code is 
shown in fig. El From n = 3 data words di, c?2, o?3 (bits 
in the simplest case) , m = 2 coding words ci , C2 are 
calculated by xoring the coding words. For example 
ci = c?i + c?2. Encoding the redundant information 
in the coding words is done in a time growing linear 
with the number of edges in the graph. The data and 
coding words are then assembled into data and cod- 
ing blocks and can be stored in a distributed manner. 

The original information of a file can be recon- 
structed directly from the data blocks by simply con- 
catenating them. This will be possible in the majority 
of cases if they were distributed locally on a relatively 



reliable LAN. If data blocks are unavailable then they 
can be reconstructed from coding blocks using the fol- 
lowing algorithm: If for a known coding block all but 
one of the data blocks from which it has been cal- 
culated are known, then the words in that unknown 
data block are the exclusive or of the corresponding 
words in the coding block and the known data blocks. 
By applying this algorithm recursively to the down- 
loaded or reconstructed blocks, the original data can 
be reconstructed in linear time, if a sufficient number 
of data and coding blocks is available. 

The amount of information encoded using an 
LDPC graph is the rate R = n/{n + m). Since LDPC 
codes do not encode optimally more than n blocks 
are needed to reconstruct the original file, when ran- 
domly downloading blocks. The overhead factor / is 
defined by the average number fn of blocks which 
need to be downloaded to reconstruct the file, where 
/ > 1. In the limit of very large (n,m oo), LDPC 
codes become optimal (/ — > 1), for small and medium 
sized codes (n, m < 10000) the overhead is typically 
in the order of 10%, depending on R |14j . 

When using large LDPC codes for the distributed 
storage on WANs, codes with / as small as possible 
are needed which allows to pick blocks for download 
based on latency or available bandwidth. However, 
for the usage we envision, that is the rehable and 
high-performance storage of files on a LAN, a small 
/ is not necessarily required since the availability of 
the blocks will be good and the available bandwidth 
does not differ for the distinct blocks (ignoring for 
now the problem of "hot" files and blocks). In fact 
for performance reasons in the normal case a cHent 
will try to download only the n data blocks in order 
to be able to reconstruct the original file by simple 
concatenation. However, we are also interested to 
give certain guarantees on the availability of a file, 
which means we need to know the worst-case over- 
head /max, such that we can guarantee a successful 
download of a file in case of /max"- blocks of the file 
being available. 

3.1 Generating Efficient Codes 

For high performance and reliable storage of files 
on a LAN, intended to replace a typical disk-server 
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Overview of R=1/2 LDPC codes 
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Figure 3: Average overhead f{n) of best generated 
codes compared to the overhead of published codes 
/published (taken from El) for R — 1/2. In 
addition the worst-case overhead factor /max of the 
best generated graphs is shown. 
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evaluations whether with the given set of blocks the 
original data can be reconstructed. The resulting er- 
ror on / is // ^/s. Computing /max requires at most 



n+m— 1 

E'= II I 



n 
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reconstruction tries where E' < E for small graphs 
(n <^ s). Since in practice most generated graphs will 
be unable to cope with even a very small number of 
missing coding blocks and /max is found as soon as 
a single download sequence fails, the average number 
of tries is close to n{n — 1) (most graphs faihng to 
compensate for 2 missing data blocks in all cases). 
Figure shows that the performance of the graphs 
generated and evaluated in this manner for R — 1/2 
can compete with the best known graphs for n < 15. 



with several ordinary computer nodes, optimal codes 
of relatively small {n,m = 0(10)) size are neces- 
sary. This scale is set on the one side by the fact 
that such disk servers normally have about ten times 
faster network links and hard-disks which can be 
achieved in a distributed storage by downloading sev- 
eral coded blocks in parallel, on the other side by the 
fact that large n + m introduce an overhead on the 
network as well as for the organisation of the system. 
While the construction of very small (n € {2,3} and 
m € {3,4,5}) optimal codes has been recently done 
[T3j . larger codes can only be constructed and evalu- 
ated by Monte Carlo techniques flSj. 

Using such a Monte Carlo technique we create 
graphs randomly for a fixed n and m and a probabil- 
ity p for a right-hand node to be connected to a given 
left-hand node. We use 0.4 < p < 0.6 depending on 
n, m based on the findings in ■ Instead of eval- 
uating the average overhead factor by sampling the 
necessary overhead for many different downloading 
sequences of blocks, we compute /max for the given 
graph. Calculating the average overhead with s sam- 



3.2 Performance of LDPC Codes 

In this section, we evaluate the performance of LDPC 
codes of different rates in order to select good values 
for m and n. We then use the solution presented in 
section l^m to generate a graph with good properties. 
We implement this graph in our storage system pro- 
totype and evaluate the overhead of decoding with a 
varying number of missing data chunks. 

Fig. 01 shows the system failure rate provided by 
four different code rates as a function of the number 
of data chunks n. Rates like 1/2 and 1/3 provide a 
low failure rate at the price of a high storage over- 
head. Rate i? = 2/3 has a low overhead but a high 
failure rate. For example, with R — 2/3, we need 
n = 14 and m = 7 for reaching a failure rate of 10~^, 
while the same availability can be obtained with a 
R = 4/7 (which has almost the same storage over- 
head) with n = 8 and m = 6. For our availability 
goals, rate i? = 4/7 provides a good tradeoff in stor- 
age overhead and availability. We use n = 8 and 
TO = 6. 

To generate a good graph for the parameters n ~ 8 
and TO = 6, we ran the graph generation algorithm 
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Table 1: Performance of reading a 500 MB file as a 
function of missing data chunks. 
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Figure 4: Failure Rate as a function of the number 
n of data chunks for different code rates. We assume 
a node availability of ^ = 0.95. 




allows to reconstruct a file with up to three missing 
chunks out of fourteen, we vary the number of miss- 
ing data chunks from zero to three. The client also 
ran on the Miinster cluster (see sect. El for a descrip- 
tion of the cluster) . We show here the average of 50 
downloads. 

There is an obvious cost in decoding missing data 
chunks out of control chunks. However, the perfor- 
mance is still acceptable, even in the worse case of 
three missing chunks. Considering that the worse 
case will not occur often in a LAN, the overhead in 
general would be much smaller. 



Figure 5: An example of a graph with n — 8 and 
TO = 6 which can be reconstructed with any 11 nodes. 
The average overhead factor is / = 1.108. 

presented in section lTTl The resulting graph is shown 
on fig. It has the property of tolerating the loss of 
any three data chunks, that is, the file can always 
be reconstructed even if any three data chunks are 
missing. 

Reconstructing a file out of both control and data 
chunks has some overhead because it requires com- 
putations instead of simple concatenation of chunks. 
We evaluated this overhead by decoding files with 
variable number of missing chunks. The tests were 
run on the Miinster testbed. Tab. shows the data 
rates obtained when downloading chunks from sev- 
eral servers, as a function of the number of failing 
data chunks. The file size is 500 MB. It was encoded 
using the graph shown on fig. Since this graph 



4 A Prototype Implementation 

A prototype for the system presented in the article 
has been implemented. We give in this section a de- 
scription of the implementation. 

4.1 Overall Architecture of the Sys- 
tem 

Files will be distributed to servers running on nodes 
involved in the system. Applications using the system 
are linked to a client library. 

Servers The system is deployed over a set of nodes 
in a computing center. Each host runs an in- 
stance of the server. The server is responsible 
to host file chunks, distributing them to clients 
(and also to receive them when a client stores a 
file in the system). We use HTTP both for file 
transfers and for control messages. 
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We associate with each server a unique identifier 
obtained by computing a hash on the name of the 
host it is running on. This implicitly defines an 
order on servers. When looking for a file chunk 
(either for storing it or retrieving it), the same 
hash function is computed on the chunk name 
and the server having the closest hash value is 
identified as the one hosting it. 

Clients Access to files is implemented in a C++ 
client library that applications link against. The 
library currently provides one call to make a lo- 
cal copy of a stored file or to copy a local file 
to the system. Files are identified by a fiat file 
name. We do not intend to implement a file 
system interface to the system but rather a "file 
server" interface. 

When a client is started, it needs to update its local 
list of hosts in the system so that it will be able to get 
file chunks from them. For bootstrapping, clients are 
configured manually with a list of well-known hosts 
that they contact at startup to obtain an updated 
list. Clients store their list persistently, so that when 
they are restarted they can quickly reestablish con- 
tact with existing hosts. The list of hosts is kept 
consistent using a peer-to-peer overlay network. See 
sect. 14. 41 for more details. 

4,2 Storing a File 

File chunks are randomly stored on a set of hosts. To 
store a file, the client first splits it into n + m chunks 
and assigns to each a name consisting on original file- 
name plus the index of the chunk. It then computes 
the hash of each chunk's name. Each chunk is stored 
on the server whose identifier is closer to the hash. 
Using a hash function enables chunks to be randomly 
distributed in the system. To read a file, the client 
can contact the hosts storing the chunks directly by 
just knowing the chunk names. 

In case of failure of some destination nodes, the 
current implementation discards the failing chunks, 
not storing them on the system. If the missing chunks 
are less than three, clients will still be able to read the 
file from the available ones. This is of course not the 



ideal solution and we are working on mechanisms for 
tolerating these failures. This question is addressed 
in sect. 14.51 

4.3 Retrieving a file 

In order to download a file, the client computes the 
hash of each chunk's name to determine the hosts 
storing each chunk. Initially, it tries to download only 
the n data chunks. Unless some nodes are unreach- 
able or some chunks are not found, these n fragments 
are sufficient to reassemble the file very efficiently, as 
they only have to be concatenated. In case of failure, 
the client uses the graph to find the control chunks 
required and downloads them instead. It is possible 
that some control chunks fail, in which case the graph 
may permit to rebuild them with other chunks. If the 
available chunks do not permit to reconstruct the file, 
that is, if there are more than three missing chunks, 
the call fails. 

HTTP provides a way of obtaining a slice of a file 
by specifying a start and an end offset. We use this to 
switch from a failing server to another in the middle 
of a file transfer. However, the client API does not yet 
provide a call for downloading a specific part of a file. 
The file is download entirely. This feature would per- 
mit the client library to ignore chunks which would 
not be needed (in case the data length allows it). 

There are a couple of interesting strategies to im- 
plement in downloading a file in a robust manner. 
In the current implementation, we have only im- 
plemented recovery based on the graph. However, 
other optimisations for improving the availability of 
files, with strategies implemented in both clients 
and servers can be envisaged and are presented in 
sect. 14.51 

4.4 Consistency of the host list 

Having a client contact the right servers for getting 
the right file chunks strongly relies on it initially ob- 
taining a consistent list of hosts. We ensure consis- 
tency of the host list using lightweight group com- 
munication between servers. When a client starts, it 
asks any server for this list and uses it to download 
chunks. 



6 



We use weak-consistency for maintaining the host 
hst. Although strong consistency would be desirable, 
we consider that it is more important to provide good 
scalability and accept some temporary inconsisten- 
cies. The protocol is designed in a way that if the 
system is left on a stable state, the information on 
the hosts will converge (eventual consistency). 

An interesting feature is that although the weak 
consistency means that a client can start with a 
wrong host list, it does not necessarily imply the ser- 
vice becomes unavailable. After the system is run- 
ning for some time, there might still be in the system 
some hosts which have a slightly inconsistent host 
list and would then provide a slightly wrong infor- 
mation to clients. A client receiving the host list 
from one of these nodes would then fail to download 
some chunks, not because of node failure but simply 
because it contacts the wrong nodes. In this case, it 
would still implement recovery procedures and hope- 
fully successfully download control chunks in order to 
rebuild the file. 

In a similar way the whole system may still provide 
100% availability with some chunks permanently lost. 
The same appHes in the case the host list clients start 
with is slightly inconsistent. 

Host Hst consistency is implemented using a ru- 
mour mongering protocol which combines pushing 
and pulling of rumours. At random times, a server 
sends (pushes) to a subset of its known peers infor- 
mation Hke "host Hi joined", "host H2 left". The in- 
formation is then forwarded to further nodes with 
a decaying probability. Additional information on 
host availability is gained by the rumour mongering 
process itself (hosts contacting each other are in fact 
pulling information on their availability). 

Assuming the frequency at which nodes enter and 
leave the system is large compared to the rate at 
which information is "gossipped" in the system, the 
information converges to a stable and consistent 
state. Convergence speed is important when the sys- 
tem is started and can be improved by tuning this fre- 
quency. However, after the system has been running 
for a long period, inconsistencies should be harmless 
as explained above and a stable service should be 
available. 



4.5 Fault tolerance 

We have explained how the system achieves robust- 
ness by using LDPC codes for encoding files. There 
are several interesting optimisations to be put in 
place concerning fault tolerance which are explained 
in this section. 

Reconstruction of lost blocks Servers can decide 
to host missing chunks due to another host fail- 
ure. In case a server disappears, all the chunks 
it is hosting are lost. Due to the erasure coding, 
this does not necessarily entail loss of files. How- 
ever, for improved reliability, the host which is 
closest in the host hash ring could recreate the 
chunks from available ones so that they would 
be found again. This should not be implemented 
systematically. However, once the host list has 
definitely lost the host and this server constantly 
gets polled for these lost chunks, it would take 
the initiative of hosting them. 

Client's host list In the current implementation, 
the client library gets a host list from the first 
available node and uses it to store files. This 
may be an inefficient strategy when the system 
is bootstrapping and many nodes have an incor- 
rect host list. We are implementing checks in 
the client library by getting several lists simul- 
taneously in order to increase the validity of the 
initial host Hst. 

Uploading failing chunks to alternative hosts 

We have explained above that the process of 
writing files is implemented in a rather opti- 
mistic way: if a destination host is failing, the 
chunk is simply not stored. Unless the cHent 
would get a more up to date list of host and 
store the chunks normaHy, a strategy could 
consist of storing the chunks in hosts which 
would take over from failing servers. This 
would increase availability without waiting for 
these servers to reconstruct the missing chunks 
themselves. 

Load balancing When a server is overloaded it 
could refuse the download requests from some 
cHents. These clients could then contact another 
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server to download an alternative control chunk. 
The client would need to decode the file from the 
control chunk or even have to download more 
than n chunks, but this might still be faster if 
the transfers can complete more quickly because 
the load is better balanced. Implementation of 
this load balancing strategy requires a very effi- 
cient implementation of decoding, so that clients 
can still reconstruct the file quickly. 

4.6 File server operations 

We have chosen to provide access to the system in a 
similar way to a file server with broad file access and 
a simple semantic. Features mandatory to file access 
are not the topic of this paper which concentrates on 
the use of LDPC codes for storing files. 

File inventory Currently the system itself does not 
know which files are stored. Instead we rely on 
an external catalogue to keep an inventory of the 
stored files. 

Access control The software here does not include 
sophisticated file access (access control, file up- 
dates, etc.). We intend to use this system in pro- 
duction and this type of feature will be quickly 
implemented. 

5 Performance 

In this section, we show performance measurements 
obtained with our prototype implementation. The 
main goal of these measurements carried over these 
testbeds is to have an idea of the overall data rate 
one can achieve with a system like the one presented 
here. Deeper studies regarding availability are on- 
going work which we intend to implement through 
simulations instead. This would permit to evaluate 
the robustness of the system with various failure pat- 
tern. Later, we shall run the system in a production 
like manner. 

We present the testbeds in sect. I5.ll and measure- 
ments are shown in sect. 15.21 LDPC codes are known 
to be efficient in terms of decoding. We measure the 



overhead of downloading a file with more or less re- 
construction involved in an other section dedicated 
to performance of LDPC codes fsect. I3.2ll . 

5.1 Testbeds 

We ran the prototype on two different platforms. 

CERN "Ixplus" computing farm is a cluster of 
about 100 dual Xeon 2.8 GHz with 2 GB of RAM 
with a Fast-Ethernet access to the network. We 
used 40 of them. 

Miinster cluster is a set of 50 dual Opteron 2 GHz 
with 2 GB of RAM. There are interconnected 
with a Gigabit network. 

On each node, we run a server. Then we store files to 
the system by sending chunks to servers. When the 
files have been stored, we start a client on each of the 
nodes. 

Clients are configured to download all the files in a 
random order so that at any time, they access differ- 
ent files. They also start downloading files one after 
each other so that we have measurements for a dif- 
ferent number of clients running in parallel. 

Both clusters are used in production and many 
nodes are very busy with computations during our 
tests. This is not a drawback because our system is 
intended to be under such conditions in practice. 

5.2 Measurements 

Measurements are shown of fig. El and H We plot on 
fig. m the aggregated rate of the whole system as a 
function of the number of clients running. On fig. [3 
we plot the equivalent data rate it provides per client. 

Aggregated data rate Overall rate measured on 
both clusters is plotted on fig. El We see that in 
both cases, the rate increases each time a client 
enters in the system. For a number of clients 
larger than 10 to 12 nodes, the overall rate stops 
growing. On Ixplus, the maximum rate is about 
110 MB/s. It is 350MB/S in the Miinster cluster. 

Rate per client It is shown on fig. [3 Rate per node 
obtained on Ixplus remains quite constant for a 
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Figure 6: Aggregated rate obtained on both clusters 
as a function of the number of cHent nodes. 



Figure 7: Rate per node obtained on both clusters as 
a function of the number of client nodes. 



number of clients between 1 and 12. It is about 
lOMB/s. For a high number of clients, the rate 
decreases to 3MB/s. On the Miinster cluster, 
the rate per node decreases from 80MB/s to 
7.15 MB/s for 50 clients. 

For a large number of hosts, the whole set of servers 
and clients obviously hits a limit of the cluster which 
depends on the average CPU, disk and network usage. 

For a low number of hosts, although the rate per 
node decreases when increasing the number of clients, 
the loss in performance is relatively low compared to 
the number of nodes. On Ixplus, the gain in perfor- 
mance per node is about 9.78 MB/s. On Miinster's 
cluster, the increase in performance per node is about 
50 MB/s. While performing these measurements, we 
noticed the increase in rate is lower than what one 
would expect. We know that TCP transfers can run 
at a higher rate on this type of hardware. 

We believe this is a consequence of some nodes be- 
ing more loaded than others and the fact we keep 
connections synchronised for avoiding storing chunks 
on disk. When ran against CPU intensive tasks run- 
ning in the same system, the performance of both 
sending and receiving data drops significantly. For 
example when running tests on Miinster's cluster, 
we noticed a simple file transfer can run at 12 to 



20 MB/s from certain busy nodes. Running system- 
atic tests with several TCP transfers in parallel, we 
have observed that these connections not only suffer 
from running on a busy node. The fact we are keep- 
ing chunk downloads synchronised (so that decoding 
is performed in memory) leads the well performing 
connections to send traffic in bursts when a block has 
been decoded. Such pattern affects even more the al- 
ready low speed of slow connections because they get 
interrupted often and system timeouts lead them to 
restart in slow-start more often. The rate of a whole 
file download is affected by this as we could observe 
it in system tests while preparing these experiments. 

We are investigating the problem but we consider 
the numbers shown in this section promising. In fact, 
since we intend to implement this storage system on 
commodity computing clusters, we expect data rate 
to be affected more strongly by other factors like 
other tasks running on the nodes, the disk usage, and 
network usage. The most critical goal is to ensure 
high availability of the service. 



6 Related Work 

Reed-Solomon codes have been used by several 
storage systems, both for WAN environments 
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(OceanStore [EI) and for LAN (RepStore HHI and 
FAB [3). All these systems use erasure coding only 
for archival storage, since Reed-Solomon codes have a 
significant storage overhead. For frequently accessed 
files and for supporting updates, these systems rely 
on replication. Many other distributed storage sys- 
tems rely solely on replication, including Gnutella 
[ni, CFS m and PAST jlHl for WANs, and Petal |S| 
and the Google File System (GFS) for LANs. In 
contrast to these systems, we rely solely on erasure 
coding by using LDPC codes. This is possible due 
to the LDPC codes' near real-time decoding speed, 
which allows us to have space efficiency without sac- 
rificing performance. 



LDPC codes have not been explored as much for 
storage applications. The most significant example 
of their use is the Digital Fountain system 0, where 
LDPC codes are used for the dissemination of bulk 
data to a large number of receivers over Wide Area 
Networks. There is also some recent work 0J that 
studies the suitability of LDPC codes for Wide Area 
Network storage. But to our knowledge, there is no 
previous work on the use of LDPC codes for storage 
on a Local Area Network environment, where the fo- 
cus is on performance and a small storage overhead. 



Our system is based on a Peer-to-Peer topology 
due to its fault-tolerance and scalability properties. 
For the same reasons, many other storage systems 
use Peer-to-Peer or decentralised topologies. On the 
Wide- Area Network some examples include Gnutella, 
CFS, PAST and OceanStore. Gnutella is based on an 
unstructured topology, while the other three systems 
are structured using a distributed hash table. For 
Local Area Networks, xFS Q], Petal and FAB are all 
decentralised systems that rely on voting and consen- 
sus algorithms for organising their topology. Also for 
the Local Area Network, RepStore uses a one-hop dis- 
tributed hash table. In contrast to these systems, our 
gossiping protocol is more light weight and scalable, 
having as a drawback the possibility of temporary 
inconsistencies. 



7 Conclusion and Future Work 

We have presented a novel architecture for a reli- 
able, high performance, distributed storage system 
on a commodity computing cluster. Storage of files 
is based on erasure coding with small Low-Density 
Parity-Check (LDPC) codes. These codes provide 
high reliability given a low storage and performance 
overhead. The main contributions of this paper are: 

• an analytic evaluation of the availability pro- 
vided by LDPC codes versus replication and era- 
sure coding, 

• a way of constructing small LDPC codes with 
good guarantees on their redundancy, 

• the description of an implementation of a file 
storage system based on LDPC encoding and 
performance measurements obtained with it on 
two different computing clusters of both the 
overall rate it provides and evaluation of the 
overhead of decoding. 

Availability provided by LDPC encoding tech- 
niques makes it a satisfying redundancy schema for 
the implementation of a storage system on a com- 
puting cluster. Our work on generation small graphs 
allows us to obtain a good availability of the service 
against possible failures of nodes. The initial perfor- 
mance results are promising. 

The work presented here is ongoing work and many 
interesting details are under study. 

• Techniques regarding LDPC codes are still being 
investigated. We continue our activity on gener- 
ating good graphs with more sophisticated ways 
of controlling the probability distribution of the 
edges in the graphs as proposed in ^|. 

• So far we presented an analytic evaluation of the 
availability provided by the use of LDPC codes. 
In the future we intend to use simulations of the 
entire system instead, so that we can study var- 
ious failure patterns, e.g. introduced due to fail- 
ures in the peer-to-peer overlay. 
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• The implementation itself is still at an early 
stage. We mainly provide a recovery based on 
the underlying LDPC graph. Using the peer-to- 
peer overlay it would be also possible for the sys- 
tem to actively recover missing blocks. In fact, 
given the nature of the coding graphs this would 
involve only a small number of hosts which have 
blocks that are related to the missing one. 

• There is currently no load-balancing done by 
the implementation apart from the trivial case 
that node becomes unavailable due to their load 
such that blocks are taken from elsewhere. How- 
ever, the servers could also distribute informa- 
tion about their load through the P2P network 
and actively reroute clients or initiate further 
replication. 

• Since we intend to use this system in produc- 
tion, in particular on grid computing sites, part 
of our activity will be dedicated to its integration 
into grid file catalogues which will also allow to 
implement access controls. 
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